Automatic Acquisition of Linguistic Knowledge: From Sinica Corpus to Gigaword Corpus

نویسنده

  • Chu-Ren Huang
چکیده

The raison d’etre for a corpus, as it was first conceived by Francis and Kucera in 1963, was to provide a body of linguistic facts from which linguistic knowledge could be generalized, [1]. The methods of acquisition have evolved as corpus size and technology have advanced in the past 40 years. Originally corpus-based concordances assisted linguists to form generalizations. This was what Fillmore [2] characterized as a ‘computer-aided armchair linguist’. Today, direct, automatic acquisition of linguistic knowledge from a corpus is becoming a reality. Two trends that are critical to the automatic acquisition of linguistic knowledge from a corpus are the increase in corpus size and the development of technology to extract linguistic relations. The release of the Chinese Gigaword corpus [3] by LDC in 2004 set the stage for a billion (1,000,000,000) word corpora; while the development of Sketch Engine by Adam Kilgarriff and colleagues [4] in the same year provided tools for the automatic acquisition of linguistic knowledge. Unlike the balanced corpus tradition established by the Brown Corpus and adopted by the Sinica Corpus (1995, the first annotated Chinese corpus) [5], the Gigaword Corpus has a uniform data source. It consists of two sub-corpora: one from the Central News Agency in Taiwan and the other from the Xinhua News Agency in PRC. In other words, the Gigaword corpus is a gargantuan news corpus representing the two major variants of Mandarin Chinese. The sheer size of the Gigaword Corpus poses both a challenge and an opportunity. First, the challenge lies in how to achieve a high quality corpus annotation with a minimal of human intervention. The current standard procedure of corpus annotation, especially POS tagging, is automatic tagging with human post-editing. It is impractical to adopt the standard post-editing procedure for the Gigaword corpus because of the scale of the undertaking. Instead, we apply the Academia Sinica tagging system developed for the construction of the Sinica Corpus, with the statistical model trained on its complete 5 million word corpus. In addition, lexicon adaptation, unknown word detection and feedback modules are implemented. The result is a truly automatic, highly efficient annotation program that creates a fully tagged Gigaword Corpus. The Gigaword corpus also provides the possibility of automatic extraction of grammatical relations. Automatic assignment of syntactic structure is a difficult NLP task. A precise structural assignment for a specific sentence or construction at the level that a linguist would desire is still impossible. However, such parochial errors become negligible when grammatical relations are extracted based on significant patterns of a large number of examples. This is the design criteria of the Sketch Engine (Kilgarriff et al. [4]) and the same criteria has been applied to the annotated Gigaword Corpus in order to construct the Chinese Sketch Engine. Our early experiments showed that the grammatical information extracted is generally reliable, although the interpretation of the acquired information must still be carried out with the aid of linguistic expertise. In conclusion, recent developments in corpus linguistics clearly point toward billion-word size, fully automatic annotation, and automatic acquisition of linguistic knowledge. These developments will shape the construction of future corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System

We propose a set of heuristics for improving annotation quality of very large corpora efficiently. The Xinhua News portion of the Chinese Gigaword Corpus was tagged independently with both the Peking University ICL tagset and the Academia Sinica CKIP tagset. The corpus-based POS tags mapping will serve as the basis of the possible contrast in grammatical systems between PRC and Taiwan. And it c...

متن کامل

The Hungarian Gigaword Corpus

The paper reports on the development of the Hungarian Gigaword Corpus, an extended new edition of the Hungarian National Corpus, with upgraded and redesigned linguistic annotation and an increased size of 1.5 billion tokens. Issues concerning the standard steps of corpus collection and preparation are discussed with special emphasis on linguistic analysis and annotation due to Hungarian having ...

متن کامل

Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research

We explore the possibility of deeper linguistic research based on corpus and computational linguistic tools in this paper. In particular, we adopt Chinese Word Sketch, the application of Word Sketch Engine to Chinese GigaWord Corpus, for linguistic research. We apply Chinese Sketch Engine results to deeper linguistic account such as selectional restriction and event type selection. The study is...

متن کامل

A Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles

There has been increasing interest in utilizing corpora in linguistic research and pedagogy in recent years. Rhetorical organization of different sections of research articles may appear similar in various disciplines, but close examination may show subtle differences nonetheless. One of the features that has been at the center of attention especially in recent years is the idiomaticity of a di...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006